Spectral dimensionality reduction for HMMs 
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Abstract 

Hidden Markov Models (HMMs) can be accurately approximated using co-occurrence frequencies of 



pairs and triples of observations by using a fast spectral method Hsu et al. ( 2009 1 in contrast to the usual 



slow methods like EM or Gibbs sampling. We provide a new spectral method which significantly reduces 
the number of model parameters that need to be estimated, and generates a sample complexity that does 
not depend on the size of the observation vocabulary. We present an elementary proof giving bounds 
on the relative accuracy of probability estimates from our model. (Correlaries show our bounds can be 
weakened to provide either LI bounds or KL bounds which provide easier direct comparisons to previous 
work.) Our theorem uses conditions that are checkable from the data, instead of putting conditions on 
the unobservable Markov transition matrix. 



1 Introduction 



For many applications such as language modeling, it is useful to estimate Hidden Markov Models (HMMs) 



Rabiner ( 1989 1 in which observations drawn from a large vocabulary are generated from a much smaller 



hidden state. Standard HMM estimation techniques such as Gibbs sampling Geman & Geman (19841 and 



EM Baum et al. (19701; Dempster et al. (19771 methods, although very widely used, can require some effort 



to apply as they are often either slow or prone to get stuck in local optima. Hsu, Kakade and Zhang, in 



a path breaking paper, Hsu et al. (20091 showed that HMMs can, in theory, be efficiently and accurately 



estimated using closed form calculations on trigrams of observations which have been projected onto a low 
dimensional space. Key to this approach is the use of singular value decomposition (SVD) on the matrix 
of covariances between adjacent observations to learn a matrix U that projects observations onto a space of 
the same dimension as the hidden state. Perhaps surprisingly, co-occurrence statistics on unigrams, pairs, 
and triples of observations are sufficient to accurately estimate a model equivalent to the original HMM. 

The true hidden state itself cannot, of course, be estimated (it is not observed), but one can estimate a 
linear transformation of the hidden state which contains sufficient information to give an optimal (in a sense 
to be made precise below) estimate of the probability of any sequence of observations being generated by the 



HMM Hsu et al. ( 2009| . The method of Hsu et al. (20091, and the extensions to it presented in this paper 
do not require any EM or Gibbs sampling, but only need an SVD on bigram observation counts. Since SVD 
is an efficient method guaranteed to return the correct result in a known number of steps, this is a major 
advantage over the iterative EM method. 



Hsu et al. Hsu et al. (2009) estimate a size mv matrix mapping between the the dimension v observation 



space and a reduced dimension space of size ni (the dimension of the hidden state space) . They also need 
to estimate a tensor of size vm? . We provide an alternate formulation that replaces their vm? tensor with 
one of size nr^ . Since the observation vocabulary, w, is often much larger than the state space {v 3> m), this 
provides significant reduction in model size, and hence, as we show below, in sample complexity. 



1.1 HMM set-up and notation 

We now introduce the notation and model used throughout our paper. 

Consider an HMM where T is an to x ttt, transition matrix on the hidden state, O is a w x tti emission 
matrix giving the probabilities of hidden state h = j emitting observation x = i, and tt is a vector of initial 



state probabilities in which tt^ is the probability that hi — i. Jaeger Jaeger (20001 showed that the joint 
probability of a sequence of observations from this HMM is given by 



Pr{xi,X2, ...,xt)^ \^ Ar^^A^^_^ ■ ■ ■ Ax^TT, 



(1) 



where A^ = Tdia,g{0^ Sx), S^ is the unit vector of length v with a single 1 in the xth position and diag(ti) 
creates a matrix with the elements of the vector v on its diagonal and zeros everywhere else. 

At is called an 'observation operator', an idea dating back to multiplicity automata |Schutzenbeegeb] 



(1961); Carlyle & Paz (19711; Fliess (19741, and foundational in the theory of Observable Operator Models 



Jaeger ( 2000 1 and Predictive State Representations Littman et al. ( 2002 1 . It is effectively a third order 
tensor, giving the distribution vector over states at time i+ 1 as a function of the state distribution vector at 
the current time t and the current observation d^^ . Since At depends on the hidden state, it is not observable. 



and hence cannot be directly estimated. But Hsu et al. (20091 showed that under certain conditions there 



exists a fully observable representation of the observable operator model. We now present a novel, fully 
reduced dimensional version of the observable representation. 

1.2 The reduced dimension model 

Define a random variable yt — U^Sxf, where U has orthonormal columns and is a matrix mapping from 
observations to the reduced dimension space. 
We show below that 



Pr(xi,a;2,...,xt) = cl^Cy^Cy^_^ ■■■Cy 



c\ 



(2) 



holds where 



Ci = /i 

C,^C(y) = K{y)Yr 



and /i = E{yi), E = E{y2yJ), and K{a) — E{y^yJyJ)a are easy to estimate using the method of momentsPJ 



The matrix U can be derived in several ways; Hsu et al. (20091 show that taking it to consist of the 



left singular vectors of P21 corresponding to the largest singular values gives good properties, where P21 is 
a matrix such that [P2i]ij = Pr[x2 ~ i,Xi — j]. The matrix U and its properties will be discussed in more 
detail below. 



^ Note that KQ is a tensor. When multiphed by a vector a, it produces a matrix. KQ is Unear in each of the three reduced 
dimension observations, yi, y2 and 3/3. 



Note that the model {ci , Coo , C (y)) will be estimated using only trigrams. Once a model has been 
learned, the probability of any observed sequence (a;i,a:2, ■■■Xt) can be computed using equation p] or the 
conditional probability Pr{xt\xi,X2...Xt^i) of the next observation xt in a sequence can be computed by 
Pr{xt\xi.t-i) = c^C{yt)ct with recursive updates ct+i — C{yt)ct/{c^C{yt)ct). The key term in the model 
is thus C{y), which can be viewed as a tensor which takes as input the current observation xt and produces 
a matrix which maps (after normalization) from the current "hidden state estimate" ct to the next one Ct+i- 
More precisely, ct+i = {U^ 0)ht+i{xi;t) is a linear function of the conditional expectation of the unobservable 
hidden state ht+i^xi-t), which is the conditional probability vector over states at time t + 1. 

1.3 Comparison to Hsu et al. 



Hsu et al. Hsu et al. (20091 derive a similar model which we state here for comparison. 



Pr{xi,X2,-;Xt) = bl^B^^B^^_^ . . . B^^bi (3) 

where 

6i - U^Pi 

and [Pi]i = Pr[xi — i], P21 as defined above, and [^32:1]^ = Pr[x^ ~ i,X2 ~ x, Xi — j] are the frequencies of 
unigrams, bigrams, and trigrams in the observed data. Note that the subscripts on x refer to their positions 
in trigrams of observations of the form (xi, X2, X3). 

Our major modeling change will be to replace B^ in equation p] with the lower dimensional tensor C{y) 
which depends on the reduced dimension projection y = U^Sx instead of the unreduced x. The models are 
easily related by the following lemma: 



Lemma 1. Assume the hidden state is of dimension m and the rank of O is also m. Then: 



Where ^ 



requires 



Pr{xi,X2,...,Xt) = l'^ A^^A^^_^ ■ ■ ■ A^^TT 
= f'ooBxtB^^_-^ ■ ■ ■ B^^bi 

U^ O to be invertible, and wV requires range{0) C ran5e(C/)r| 



(4) 
(5) 
(6) 



Proof sketch: Paper Jaeger ( 2000 1 showed Q , paper Hsu et al. ( 2009 1 showed (l5| , and (|6| follows from 
a telescoping product of the following items: 



ci 






U^O 



iT /rTTr^\-l 



c^ = 1' {u'oy 

Cy = C{v) = U^O A, (U^O)-^ 
where y — U^ Sx- More details are given in the supplemental material. 



D 



We improve Hsu et al. (20091 in three ways: 



1. By reducing the size of the matrix that is estimated, we can achieve a lower sample complexity. In 
particular, our sample complexity does not depend on the size of the vocabulary nor on the frequency 
distribution of the vocabulary. 



2. Since the conditions given in Hsu et al. (2009) are in terms of the transition matrix T, they can not 



be checked. We instead focus on conditions that are checkable from the data. 

3. Instead of using either a LI error or a relative entropy error, we estimate the probabilities with relative 
accuracy. In other words, we show that \p ~ p\/p is smaller than e. This often is a more useful bound 
than just knowing |p — p| is small. For example, it implies that computing conditional probabilities are 
off by less than 2e. Both LI and relative entropy errors can be computed from these bounds. 



Our main theorem is weaker (as stated) than Hsu et al. ( 2009 1 in that we assume knowledge of U rather 



If the matrix U is formed from the left singular vectors of P21 corresponding to nonzero singular values, then it will satisfy 



this condition; See Hsu et al. I 2009 1 lemma 2 



Our method 



Hsu et al's method 





Figure 1: Two HMMs with states hi, /i2, and h^ which emit observations xi, X2, and 0:3. On the left, they 
are further projected onto lower dimensional space with observations yi, y2, ys by U from which our core 
statistic Cy is computed based on K = E{y^yJyJ) which is a {ni x m x m) tensor. On the right, xi is 
hit by (^21^)^ to make a lower dimensional zi, X2 is left unchanged and X3 has its dimension reduced by 
U^ . These terminal leafs are then used by 



Hsu et al. 



(2009) to estimate their B^; via estimating E{y^Zi J ' ) 



which is a tensor of size (m x m x v). 

than estimating it from a thin SVD of P21 as they do. Since the accuracy lost when estimating U is identical 
to that given in their paper, we will not discuss it here. 

2 Theorems 

The remainder of this paper presents one main theorem giving finite sample bounds for our reduced dimen- 
sional HMM estimation method. We first derive these in terms of properties of the first three moments of 
the reduced rank y's, where Y is the random variable which takes on values of the reduced rank observation 
y — U^ 5x- We then convert those bounds to be in terms of the estimates, rather than the unobservable true 
values, of the model. 

Our general strategy of estimating Pr(xt,xt_i, . . . ,xi) is via the method of moments. We have Pr() 
written in terms of c^, ci and C{yt). Since each of these three items can be written in terms of moments of 
the y's we can plug in these moments to generate an estimate of Pr(). Thus we can define: 



Pr(2:t, xt-i, ...,xi) = c^C{yt)C{yt-i) ■ ■ ■ C{yi)ci 



(7) 



where 



ci = M 

C{y) = K{y)t-^ 

where /i, S and iir() are the empirical estimates of the first, second and third moments of the F's, namely 

/J = },YL,y^\ S = j.Y.l^y^'^y^'''' . K{y) = },YLiy^y^'^^y^^V^ where yW indexes the N 
different independent observations of our data. These moments estimate the mean vector /i, the variance 
matrix S. and the skewness tensor K(). 

Definition 1. Define A as the smallest element of jjl, E^^ and K{). In other words, 

A = minjmin |/ij|, min |E,^-^|, min \Kijk\} 

where we define Kiji,. — K(Sj)ik are the elements of the tensor K{). Likewise we define the empirical version 
as 

A = min{min|/ij|,min|E"-^|,min|i4rijfe|} 

Definition 2. Define Um as the smallest singular value ofT,, and am the smallest singular value ofT,. 

The parameters A and a^ will be central to our analysis. Theorem 1 gives sample complexity bounds 
on relative error in estimating the probability of a sequence being generated from an HMM as a function 
of A and (Jm , and the following lemmas reformulate those bounds into a more useful form in terms of their 
estimates. As quantified and proved below, both A and a„i must be "sufficiently large"; when they approach 
zero one loses the ability to accurately estimate the model. 

If (J,n — then U^ O will not be invertible, and one cannot infer the full information content of the 
hidden state from its associated observation, violating the condition required in |Hsu et al. (2009) for (Isl to 



hold. As arn becomes increasingly close to zero, it becomes increasingly hard to identify the hidden state, 
and more observations are required. Problems with small a^ are intrinsically difficult. As has been pointed 



out by Hsu et al. (20091, some problems of estimating HMM's are equivalent to the parity problem Terwijn 



(2002). So for such data, our algorithm need not perform well. For parity-like problems, ct„i is in fact zero. 



or close to it; Hence we end up with a useless bound for such hard problems. 

If A is close to zero, then even if the absolute error is small, the relative error can be arbitrarily large, 
as it involves dividing by the small true value of the parameter being estimated. Fortunately, as discussed 
below, since A depends on the somewhat arbitrary matrix U, one can shift A away from zero by rotating 
and rescaling U. 

The proof of Theorem 1 is based on the idea that if we can estimate each term in /i, S and K{) accurately 
on an absolute scale (which will follow from basic central limit like theorems) then we can estimate them 
on a relative scale if A is large. Hence, our main condition is that A is bounded away from zero. In fact, 
if we take the usual statistical limit of having the sample size N go to infinity and holding everything else 
constant, then: 



Pr(a;i,... ,Xt) 



1 



<^^^yki(W^ 



alAVN 



PT{xi,...,Xt) 

with probability greater than 1 — 5 when A^ is large enough. 

The following theorem gives the finite sample bound in terms of a sample complexity: 



Theorem 1. Let Xt he generated by anm > 2 state HMM. Suppose we are given a U which has the property 
that range{0) C range{U) and \Uij\ < 1. Suppose we use equation (M) to estimate the probability based on 
N independent triples. Then 



N> 



128m2 



( ^'+^rT^ - 






(8) 



implies that 



l-e< 



Pr(xi,...,a;t) 
Pr(a;i,...,a;t) 



< 1 + e 



holds with probability at least 1 — d. 



Before proceeding with the proof of this theorem, we present and prove two corollaries that correspond 



directly to Theorems 6 and 7 of Hsu et al. (2009 1. 



Corollary 1. Assume Theorem^ holds, then with probability at least 1 — 5, 

^ \PT(xt,...,Xt) -Pr(a;i,...,a;t)| < e 



Proof of Corollary [l] We have 

l-e< 
K(xi, 



Pr(a;i,...,a;4) 



PT{xi,...,Xt 
..,Xt 



< l + e 



1 



< e 



PT{xi,...,Xt) 

=> Pr(a;i,.. .,Xt) -Pr(xi,. ..,Xt) < ePr(xi,.. .,Xt) 
=> ^ Pr(a;i,...,Xt) -Pr(a::i,...,a;t) 

Xi,...,Xt 

<e ^ PT(xi,...,xt) 

Xi,...,Xt 

=> ^ \Pr{xi,...,Xt) -Pr{xi,...,xt) <e 



xi,...,xt 



Corollary 2. Assume Theorem^ holds, the 



n we nave 



KL{Pv{xt\xi, . . .a;t_i)||Pr(xt|xi, . . . Xt-i)) 



PT{xt\xi 



■■Xt-1 



Proof of Corollary [2} We have 

l-e< 

^l-e< 
< 



Pr(a;i,.. .,Xt) 



< 1 + e 



Pr(a;i,.. .,Xt 
Pr(a;t|xi:t„i)Pr(a;i:t_i) 



Pr(a;t|a;i:f_i)Pr(xi:t_i) 



<l + e 



1 + e 



Pr(a:t|a;i:t_i) 



Pr{xt\xi.,t-i) 



< 



1 + e 
1-e 



D 



and using the fact that for small enough x we have j^ < 1 + 3x and 1 — 3x < jx^ , plus the fact that eo < | 
we have 

^1 - 3e < 
1 



l + 3e 



< 



Pr(a;t|a;i:t_i) 


< l + 3e 
. 1 


Pr(a;t|a;i:t_i) 
Pr(a;t|a;i:t_i) 


PT{xt\xi:t-i) 


- 1-3 



and using a similar fact from above that for small enough x, yz^ < 1 + 2a:, we get 

, Pr(xt|a;i:t_i) 



Pr(a;4|a;i:t_i) 



< l + 6e 



In 



Pr(a::t|a::i:i,i; 
Pr(a;t|a;i:t_i^ 



< ln(l + 6e) < 6e 

> Pr xi,...,xt)ln ^^-^—. r 

< 6e ^ Pi{xi,...,Xt) 



^E\n 



Xi,...,Xt 

FT{xt\xi;t^l) 
FT{xt\xi.,t-l) 



<6e 



D 



2 log: 



Define J = 2my — ^^- to simplify the following statements. The proof proceeds in two steps. First 
lemma [2] converts the sample complexity bound into a more useful bounds on A and (T„j. Then lemma [3] uses 
these bounds to show the theorem. 



Lemma 2. // 



then 



N> 



128m2 



(=*vrT^-i)2A2< 



log 



2m 



A > 



3J 



CTm > 4J 

The proof is straightforward and given in the appendix. 



(9) 
(10) 



Lemma 3. If equation (^ of TheoremUjis replaced by [m! and (10) then the results of the theorem follow. 
Proof of Lemma [sj Our estimator (see equation [?]) can be written as 

Vt(xi, ...,xt)^ ^^Y.-^k{yt)Y.-^ ■ ■ ■ K{yi)t-^^ 
We can rewrite this matrix product as 



Pr(a:i,...,xt) = 

m m 

E-- E 



lWi2t+3 



10 



The components [K{y)]a,b can be written as a scalar sum as: 

[K{y)]a,b == yi[K]a,b^ +y2[K]a,b.2 + ■■■ + ym[K]a,b,i 



So, 



Pr(xi,...,xt) 



2^ii,...,i2t+3 [i^\ii [^ \ii42 [^Ji2,i3 Jl [yt\ji 
31, ■■■Jt 



[S ]i3,lAK]i5,i6,J2[yt-l]j2 



■j22t + 3 



This is just a sum of a product of scalars. Lemma H (stated precisely and proven in the appendix) shows 
that accuracy of our estimates of all elements of /i, S^^ and K{) are bounded by 3J/a^^ with probability 

Each term in the product can be rewritten as 



and so our products can be thought of as, instead of a product of observed quantities, the product of the 
theoretical quantities times some relative error term. We can bound this relative error term for all entries, 
which will allow it to factor out nicely over all summands, giving us a relative error term for our overall 
probability. 

Again thinking of as a generic item in /i, S, or K{), then above has shown that |0 — 6'| < 3J/afj^ and 
so the relative error of each term is bounded as 

3J , 3J 

1 ^^ < - < 1 



which will hold for all terms with probability 1 — S. Since \9\ > A, we see that 

3J , 3J 

1 ^^ < ^ < 1 



alA 9 alA 

Since our Pr[) is a product of 2i + 3 such terms, we see that 

1 ^^ < tAt < 1 



alAj - Pr() - V alA 



11 



So by our bound on A, we have 



Pr() 



holds with probabiHty 1 — 5. D 

The sample complexity bound in Theorem ^ relies on knowing unobserved parameters of the problem. 
To avoid this, we modify Lemma |3] to make it observable. In other words, we convert the assumptions of 
sample complexity into a checkable condition. 

Corollary 3. Let Xt he generated by an m > 2 state HMM. Suppose we are given a U which has the property 
that range{0) C rangeiU). Suppose we use equation (MJ to estimate the probability based on N independent 
triples. Then with probability 1 — 5, if the following two inequalities hold 



Kat > 12m + 



Qm 



( ^* VTTi - 1) 



'21og^ 



TV 



(?m > lOmi 



/21og^ 



N 



(11) 
(12) 



the 



l-e< 



Pr(a;i 



,Xt) 



Pr(a;i 



,Xt) 



< 1 + e 



Proof: 



Two technical lemma's are needed for this corollary: Lemma[4]and LemmalS] They are stated and proved 
in the supplemental material. Lemma [4] basically says that with high probability, each element of /i, E and 
K{) is estimated accurately. This is then used in Lemmap^to show that A and a„i are estimated accurately. 

Define the event A to be the set where all the estimates given in Lemma HI hold. This event happens 
with probability 1 — 5. On this event from Lennna 5 we know am > ^am, so cr^ > j'^^m- Hence 

6to 



A> 



'21og2=i 6m /21og?f 



+ 



a2„(-+^rTi-l)V N ' al\ N 
thus on the set ^ if ( 11 ) and ( [T2| ) hold, then we see that ^ and ( 10 1 both hold and so we can apply Theorem 
[l] We can now use Theorem [l] to generate our claim on the accuracy of our probability bound. Technically, 
this proof as given only shows that our corollary holds with probability 1 — 25. But since the set where 
Theorem [l] fails is exactly A"^', the probability lower bound is 1 — i5. 



12 
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The advantage of the corollary is that the left hand sides of the two conditions are observable and the 

right hand sides involve known quantities. Hence one can tell if the condition is true or not-it doesn't require 

knowing unobserved parameters. Note that the statement is of the form Pr(A => B) > 1 — 6 so interpretation 

must be done carefully. 

3 Discussion: effect of A and a^ on accuracy 

As discussed above, am and A have different effects on sample complexity. As am approaches zero, model 
estimation becomes intrinsically hard; some problems do not admit easy estimation. In contrast, role of A in 
sample complexity is more of an artifact. As A approaches zero, the relative error can be arbitrarily large, 
even if the estimated model is good in the sense that the probability estimates are highly accurate. 

The problem with A can be addressed in a couple ways. In this section, we show that estimating a 
likelihood ratio rather than the sequence probabilities gives improves relative accuracy bounds. An alternate 
approach, which we do not pursue here, relies on the observation that A depends on the (underspecified) 
matrix U, and that one can thus search for a rotation and rescaling of the matrix U that increases A. 

3.1 Likelihood instead of probabilities 

Obscure words correspond to rows of the observation matrix with very small values throughout the row. If 
we were interested in only estimating the probability of such a word, then these are the easy words-basically 
guess zero or close to it. But, since we would like to estimate the relative probability accurately, these words 
are the most challenging. Further, such small probabilities would make computing conditional probabilities 
unstable since they would then become basically "0/0." Further, since the values are all small in O and in U, 
they do not significantly improve our estimates of /i, S and K() since they are essentially zeros. Both of these 
problems can be fixed by considering the problem of estimating a likelihood ratio instead of a probability. 
So define: 



\(^li • • • J Xt) — 



Pr{xi,X2,-.-,Xt) 

Pl{xi)Pi{x2) ■ ■ ■ Pl{xt) 
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The Pi{x) could be taken to be the marginal probability of observing x. It does not, in fact, have to be a 
probability-just any weighting which helps condition our matrix S and our tensor K(). We can then use 
a modified version of O and U in all our existing lemma's and theorems. The precise statement of these 
modified versions are in the appendix. What changes is that now A is much larger and hence our relative 
accuracy will be greatly improved. This fact is shown in the empirical section. 

3.2 Empirical estimates of A and cr^ 





Figure 2: First graph: A vs m, generated using vocabulary size 20,000, Slope w —6. Second graph: a 
vs m, generated using vocabulary size of 10,000, Slope w —3.2 

Figure |2j shows estimates of A and am, using the Internet as the corpus as summarized in the Google 
n-gram dataseirl which contains frequencies of the most frequent 1-grams to 5-grams occurring on the web. 
Details on how the figures were generated can be found in the supplementary material. As the size, m, of 
the reduced dimension space is increased, smaller and smaller singular values, am, occur in the model, and 
the value A of the smallest parameter in the model decreases. Empirically, both fall off with a power of 
m, giving straight lines on the log-log plot. This data indicates a large sample complexity, the reduction of 
which will be a focus of future work. 



^ http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to- you. html 
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4 Prior work and conclusion 



Recently, ideas have been proposed that push spectral learning of HMMs in several different directions. 



Boots et al. (20101 provides a kernelized spectral algorithm that allows for learning an HMM in any domain 



in which there exists a kernel. This allows for learning of an HMM with continuous output without the need 



for discretization. Boots & Gordon (20111 provides an analogous algorithm that enables online learning for 



Transformed Predictive State Representations, and hence the setup in Hsu et al. (20091. Finally, Siddiqi 



et al. ( 2009 1 directly extends Hsu et al. ( 2009 1 by relaxing the requirement that the transition matrix T be 



of rank m, but instead allows rank k < m, creating a Reduced-Rank HMM (RR-HMM), and then applying 



the algorithm from Hsu et al. ( 2009 1 to learn the observable representation of this RR-HMM. 



All of the above extensions preserve the basic structure of the tensor B^ , which updates the hidden state 
estimate (or more precisely, a linear transformation of it) based on the most recent observation x. In this 
paper, we replace B^ with a tensor C{y), which updates the hidden state estimate using a low dimensional 
projection y of the observation x. C{y) contains only nv" terms, in contrast to the rrfiv terms contained in 
Bx ■ Reducing the number of parameters to be estimated has both computational and statistical efficiency 



advantages, but requires some changes to the proofs in Hsu et al. (20091. While making these changes, we 



also give proofs that are simpler, that only use conditions that are checkable from the data, and that bound 
the relative, rather than absolute error. 

This paper focused on the simplest case, in which HMMs have discrete states and discrete observations 
and in which the observations are reduced to the same sized space as the hidden state, but our approach 
can be generalized in all of the ways described above. 

We have presented an improved spectral method for estimating HMMs. By using a tensor Cy that 



depends on the reduced rank y instead of the full observed x in the B^ tensor used by Hsu et al. ( 2009 1 , 



we reduced the number of parameters to be estimated by a factor of the ratio of the size of the vocabulary 
divided by the size of the hidden state. This reduction has corresponding benefits in the sample complexity. 
We also showed that the sample complexity depends critically upon (T„j , the smallest singular value of the 
covariance matrix E. As am becomes small, the HMM becomes increasingly hard to identify, and increasing 
numbers of samples are needed. 
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APPENDIX- SUPPLEMENTAL MATERIAL 

Lemma (Restatement of Lemma IT]). Assume the hidden state is of dimension m and the rank of O is 
also m. Then: 

Pr{xi,X2, .■.,xt)^ l'^A^^A^^_, ■■■A^.TT Q 

Pr{xi,X2,. ..,xt) = kLBx^Bx^_^ ■■■ B^^hi ^ 

Pr{Xl,X2,...,Xt) =cloCy,Cy^^, ■■■Cy.Cl ^ 

Where iM! requires U^ O to be invertihle, and \m requires range{0) C range{U). 

Proof: 

As pointed out in the main text, Jaeger (2000) showed H, and Hsu et al. (2009) showed (l5|. To show 
(Im, we will first write the characteristics /i, S and K in terms of the theoretical matrices, T, O, U . and tt: 



K{y) 



C/^O T diag(7r) O'^U 

(O'^U)-^ diag(^)-i T-i (U^O)-^ 

U^O T diag(O^C/y) T diag(7r) O^U 



By definition, we have 



likewise, 



Cl EE ^ = WO TT 



„T _ ,,Tv-l 



(^^0^[/) ((O^C/)-i diag(^)-i 

• T-i {U^O)-^) 
TT^ diag(7r)-i T-i (C/^O)-! 

iTr-i([/To)-i 
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For C, 

C{y) - K{y)j:-' 

= U^OTdiag{O^Uy) 

■ rdiag(7r) O^t/S"^ 

Note that UU^ is a projection operator and since its range is the same as that of O we have O^UU^ — O^ . 
So, if 2/ = U"^ 5x, then: 

C(2/) = U^0Tdi&g{0'^UU^5x){U^0y^ 
= C/^OTdiag(C'^5^)(C/^0)-i 

Thus (l6| follows from a telescoping product. 



Proof of lemma [2} 

The proof is simply algebraic manipulation. We have 



128m2 , /2m 

N > ; log -^ 



which implies that 



., 128m2 , /2to 
A > log 

72m2 /2to\ 

- (-vrTi-i)27v< °s (v^ j 

and taking the square root and making the relevant substitution for J we have 

3J 



To show the bound for am we have that 



128m2 /2m 
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D 



and noting that A < 1 and ^'Vl + e - 1 < 1, 



4 128to2 /2m 



Taking the square root of both sides and making the relevant substitution, we get 



at > AJ 



and since a„i < 1 inapHes a^ < am then we get the desired inequahty. 
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Lemma 4. Our estimates of all elements of ^, T, ^ and K() are bounded by 3J/a^ with probability 1 — ^, 



where J = 2m 



21og^ 

N 



Proof: 

We first derive absolute bounds for each entry of /i. S and K(). To handle all three of them at the same 
time, we will generically call any one of these three "6*" and its estimate 6. Suppose that 6 has g entries that 
are taking the mean with N observations all of which are bounded between —1 and 1. Then, for each entry 



we have from HoefFding ( 1963 1 that 



Pr{\0,-0,\ >e)<2e"^ 



and so 



Pr{3 i s.t. \9, 



> e) < 2ge^- 



and setting 2ge 2 = S we solve that e 



Vi 



N 



SO with probability 1 — 6 wc have that 



-eA < 



'2bgf 

N ' 



Note that for fi, S and K{) we have a vector, a matrix and a tensor that are estimated as E{Yi), E{YiY2^) 
and EiYjY^Y^) respectively with m, rn^ and w? entries respectively, we see that the total number of entries 
in all three of them is less than rn^. (Except in the trivial case where m = 1. But this corresponds to the 
data being IID and so doesn't count as a HMM.) So all three of the following hold simultaneously with 
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probability I — S: 



Vi l/ij-Mil < 



'81og^ 



A^ 



'81og^ 



A^ 



(13) 



'81og^ 



Af 



Lastly we need to bound S ^. We will start by bounding the norm of E — S. By (13l we see ||S — S||i„ax < 



Sloi 



^ 5_ 

N 



by the relationship ||Af ||2 < TiUMUmax for m x to square matrices, we get the desired result. 



From this bound on ||S — S||2 and lemma 20 of Hsu et al. (20091 we have that 



o'm — cr,„ I < J 



where am is the smallest singular value for S. By their Lemma 23 we then have that 



(14) 



|S-^-S-^||2 



< 



1 + V5 / 1 



J 



2 \^m — J , 

By assumption am > 4 J, we see am — J > 3o'to/4. Thus from the algebra that ^ , ^ (|)^ < 3, we see 

||g-l_S-l||2<3J/fT^. 

From ||S^^ — S^^Hmax < 11^^^ ^ ^^^Ib we get our element-wise norm on the errors. Since am < 1, we see 
that 

3J/a^>3J = 3TOy^^>y^^ 
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Lemma 5. The estimates of A anrf a^. have the following accuracy: 



|A-A| < 



6m /21og^ 



CTm - Cm I < 2m 1 



'21og^ 



N 



with probability greater than 1 — 5. 



Proof: A is the empirical minimum of all the 



A = minjmin |/i., | , min | E -/ 1 , min \K^j^k\} 



i;j,k 
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From lemma we have bounded the accuracy of the estimate of each element of /i, S and KQ, the minimum 



of these will be estimated within the same accuracy. This established (15 1. 



The second inequality ( 15 1 was also established in the proof of the theorem in equation ( 14 1 
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5 Likelihood ratio version of theorem [T] 



In 3.1 we considered the likelihood ratio as a way of getting a better estimator. There we used a weighting 
vector Pi which normalized our probability. In other words, 

Pr{xi,X2,-.-,Xt) 

PxiPx2 ' ' ' Pxt 

It will be a bit more mathematically convenient if we instead use qi — 1/ ^/pi instead. So, define: 

Q{xi:t) = Qixi,X2, ...,Xt) = q{xi)q{x2) ■ ■ ■ q(xt) 

Then our "likelihood ratio" is 

A(a;i, a;2, . . . , Xt) = Pr{xi,X2, • . • , Xt)Q{xi,X2, ■■■, Xt)"^ 

We will think of these g^'s as a vector and define 

O* = diag(g)0 

and 

Al = rdiag(0*^diag(r7)4) 
We will then be able to show a similar product rule as (IT]) : 

The version of this product rule we will estimate is also similar. We will define U* — dia,g{q)U and y'^ = 
U*^ dia.g{q)6xt ~ U^<ii&g{q)^5xf Our statistics are then: 

M* = E{y*i) 

E* ^ E{y;y*J) 
K*{a) = E{y^y*Jyf^)a 
22 



Defining our characteristics as before: 

C*{y*) = K*{y*)j:*-^ 
These can also be used to estimate A as the following leninia shows: 
Lemma 6. Assume the hidden state is of dimension m and the rank of O is also m. Then: 

\{xi,...,xt) = Pr(xi.t)Q^(xi.,t) 

= CC*{yl)---C*{y\)c\ (15) 

Where the last equation requires 
range{0) C range{U diag{q)) . 

Proof: 

Al = rdiag(0*^diag(g)4) 

= T diag((diag((7)0)^diag(g)4) 
= T diag(0^diag(g)24) 

= rdiag(0^diag(g)2diag(4)l) 
= rdiag(0^diag(4)'cliag(<7)2l) 

= rdiag(OTdiag(4)(gD) 
= rdiag(0^(5,)g2 

= A^ql 

where we have used aJ (Ha,g{5x)b — {aJ 5x) {h^ 5x)- 
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Our "starred" versions can be written in terms of the basic items T, O, U , n and q: 

fj,* = C/^diag(g)207r 

S* = C/^diag(g)2ordiag(7r) 0^diag(g)2[/ 
S*-i = (0^diag(g)2[/)-i diag(7r)-i 

K*{x) = C/^diag(g)20rdiag(0^diag(g)2C/x) 
• rdiag(7r)0^diag(g)2t/ 

So, we have 



Hkewise, 



For C* we 



Ci = M* = U^dia.g{qfO tt 

Coo = A* ^ 

= (7r^0^diag(g)2L/) 

• {{O'^di'Ag(qfU)-^ diag(7r)-i 
• T-i (C/Tdiag(g)20)-i) 
= TT^ diag(7r)-i T-i (C/^diag(g)20)-i 
= l^r-i([/^diag(g)20)-i 



l'((7'diag(g)^0)- 



= C/^diag(g)20 T diag(0^diag(g)2c/y) 

• T diag(7r) 0^diag(q)2[/ S*~i 
= C/^diag(g)20 T diag(0^diag(g)2c/y) 
• (C/^diag(g)20)-i 

24 



Note that U*U*^ is an n x n projection operator. Since its range is the same as that of O* we have 
0*^U*U*^ = O*^. So, iiy* = [/*^diag(g)4, then: 

C*{y*) = U^dmgiqfOT 

■ diag(0*^C/*C/*^diag(g)4) 
• (f/^diag(<7)20)-i 

= U^di&g{qfO T diag(0^diag(g)24) 

• (C/^diag(g)20)-i 
= (C/^diag(g)20) Al (C/Tdiag(g)20)-i 



Hence equation (15 1 follows by a telescoping product. 
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Theorem 2. Let Xt he generated by an in > 2 state HMM. Suppose we are given a U which has the property 

, Xt) based 



that range{0) C range{U) and \Uij\ < 1. Suppose we use equation {15) to estimate A(xi,X2,.. 
on N independent triples and for appropriate choice of U* . Then the following two inequalities 

Qm 



'21ogf^ 



2m 

^* > ^::: , / — " 5 



a* > 8m\ 



l2\og^-f 



N 



(16) 
(17) 



(where cr*„ is the smallest eigenvalue of Y.* ) imply 

1-e < 
or equivalently 



X{xi,...,xt) 
X{xi,...,xt) 



< 1 + e 



l-e< 



Pr(xi,...,a;t) 



Pr(xi,...,a;t) 



< 1 + e 



holds with probability at least 1 — S. 



Proof: 

The proof of this goes is identical to that given for theorenilT] The only worry is that we have defined y* 's 
difi'erently. But since we only required \y\ < 1, and we have constructed |y*| < 1. the Hoeffding inequality 
with elements of U still hold for U* . 
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Details of generating the graphs 

In lemma [6] and theorem [2] we see that we can increase our chances of obtaining a large enough A by 
multiplying each row of U by some function of that row. As long as we ensure that the elements of our new 
U* are less than one, then we can make a claim on the accuracy of the relative "likelihood", and hence the 
relative probability, generated by our sample. 

Our figures utilize this gain in the size of A. For our corpus we use the Internet as captured by the 
Google n-gram dataset. We first create a dictionary of the v — 1 most popular tokens, as well as an "out of 
vocabulary" token, for a final dictionary of size v. We take U to be the U matrix generated by the 'thin' 
SVD of the P21 matrix generated using this vocabulary and Google 2-grams. 

From this U we consider the first m columns. As per above, we can increase our chances of obtaining 
a large enough A by maximizing the size of the entries in this new v x m dimensional U matrix, hence we 
multiply each row by 1/ niaxj(|[/ij|), ensuring that at least one of the elements in our matrix is exactly 1 or 
— 1. Now, using this new matrix U* we use the frequencies from Google 1-grams, 2-grams, and 3-grams to 
compute /i*, S*, and K* respectively, where each of the v vocabulary words (including one out-of- vocabulary 
token) correspond to a row of U* . From this, we take S*^^ and compute the minimum element across fj,*, 
S*-i and K*. 

We obtain cr*„ in a similar way, first computing E* from the appropriate v x m dimensional U* matrix, 
then taking the SVD, recording the smallest singular value. 
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